CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 03: Reading, writing, and wrangling data

Student Name: Live Lecture File

Student ID: 220127


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this third lecture in a series of seven. Today we will pick up where we left off last week with DataFrames. Now that we know some of the basics regarding this data structure, we can drill into using them to properly format your data for further analyses.

At the end of this lecture we will aim to have covered the following topics:

  1. Wide vs. long-format data and the Tidy data philosophy.
  2. Importing Excel spreadsheets.
  3. Data wrangling.
  4. Exporting wrangled data.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: miscellaneous.xlsx

An Excel book of five sheets, where each sheets is a dataset. We will use this dataset to demonstrate how to import Excel files into Python using Pandas.

0.4.2 Dataset 2: The Human Microbiome Project

This is a dataset resulting from the phase 1 of the Human Microbiome Project (HMP). The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.

In very general terms, the microbial DNA is extracted from a sample and the 16S rRNA gene is amplified using polymerase chain reaction (PCR), the gene amplicons sequenced, and the resulting sequences are matched against a reference database. The results are presented as counts of Operational Taxonomic Units (OTU), which are the equivalent to microbial species for data interpretation.

There are two datasets located in the data folder which we're interested in:

These files, however, are excessively large - especially the metadata file. Therefore we'll be using a subset of the those files called:


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

random is a package with methods to add pseudorandomness to programs

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

os pandas matplolib


1.0.0 The long and the wide of it - formatting your data for analysis

1.0.1 Defining our data terms

1.1.0 Wide versus long format

Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a dataset is formatted.

In a long formatted dataset, each column is a variable and the results of each measured variable are stored in rows (observations). In contrast, not every column in wide formatted data is necessarily a variable so you can have several observations of the same type of variable in the same row. The names long and wide come from the general shape that the two data formats have.

For data science applications, long format is preferred over wide format because it allows for easier and more efficient computations, data subsetting and manipulation. Wide format is more friendly to the human eye and easier to work with when data needs to be manually recorded/input. Therefore, having the ability to interconvert between these two data formats is a valuable and required skill. The following is a general scheme of wide- (left) and long-format (right) datasets:

wide_and_long_formats.png
While more readable and technically more compact, the wide data format is not easily parsed for data analysis compared to the long data format.
Why can't computers think like we do? To read more about wide and long formats, visit this blog post on spreadsheet thinking versus database thinking.

1.2.0 The tidy data philosophy

Why tidy data?

Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.

Tidy data has:

  1. Each variable form a column.
  2. Each observation form a row.
  3. Each type of observational unit form a table.

This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

The 5 most common problems with messy datasets are:

Fortunately there are some tools available to solve these problems.

Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking these puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. In that case, each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.

1.3.0 Reading in data

1.3.1 Looking closer at our data

In our case, we have two datasets ready in our data folder to import for this lecture:

1.3.1.1 hmp_data

1.3.1.2 hmp_taxa_metadata

1.3.2 Our data-wrangling goal

tell-me-more-about-your-data-wrangling-course_small.jpg
More than 50% of your time may be spent dealing with the shape and format of your data. Once that is figured out, the analysis can seem rather simple!

We'll begin by learning how to import different data formats. Next we will use Pandas to introduce concepts of data wrangling to convert out datasets from wide to long formats. When we are ready, we will merge the two files into a single dataset that we can use for exploratory data analysis.

These datasets are part of the R package HMP16SData. Details on how to prepare the data for download can be found in the HMP16SData vignette. To read more about the HMP project, visit https://www.hmpdacc.org/.

Data vs metadata? Sometimes the idea can be confusing when thinking about data vs metadata. Both sets are considered data but the former usually refers to measurements that may be raw or unfiltered. These sets of measurements themselves may contain metadata information about their samples. Metadata on the other hand is always be processed in some way with information that is relevant to and describes the measured data. Simply put - data provided content, while metadata provides context. For a brief comparison check out these brief descriptions.

2.0.0 Working with Excel spreadsheets

Given the popularity of Excel, it is common to find ourselves working with Excel books. Before importing the file, we need to know our current working directory (where in the computer are we working from?), change the directory if necessary, and list the files available for importation. For those purposes we will use the built-in OS library

2.1.0 The os package provides access to the operating system

In order to access functions and files in your operating system, the os package provides utilities that facilitate navigating through directories and accessing system information. Here's a table with some helpful functions and their description.

Function Description
getcwd() Retrieve the current working directory. This is the location where Python thinks you're working from. This is usually the directory holding your Jupyter notebook.
chdir() Change the current working directory. This can be an absolute path (ie C:/Users/yourname/) or relative to where you currently are (ie data/)
listdir() List the files and folders in the current directory (default) or in the directory specified as a parameter.
mkdir() Make a new directory (aka folder) in the current directory (default) or by providing a relative or absolute path. Note: you will get an error if attempting to make a directory that already exists.
rename() Rename a folder or file. You can even move things to new directories as long as the destination path exists.
rmdir() Remove a directory but will give an error if the directory does not exist
remove() Remove a file (path) as long as it exists.

2.2.0 Use pandas read_excel() to import your data

So we covered an important series of os-based commands that are helpful in moving and renaming our data. Sometimes the data itself might have issues, which is all part of data wrangling. To simplify our process, however, the pandas package has already created a function to deal with headers and other issues that can appear when importing .xlsx files and yes, it runs openpyxl under the hood (See Appendix 1, section 6.0.0).

By default read_excel() will use the first sheet in the workbook but we can also use the sheet_name parameter with the actual sheet name, or it's sheet position. The resulting imported data will be stored in a pandas.DataFrame.


2.2.1 Use pandas ExcelFile() class to get a list of sheet names

Although we can read in sheets by position, unless we know the sheet names a priori it makes it hard to call on a sheet name specifically. It can be done quite easily with the openpyxl package (See Appendix 1, section 6.0.0) but we can also intialize an ExcelFile object which will have the .sheet_names property.

In fact, over the years, the function ExcelFile.parse() has become virtually identical to the read_excel() function if we were to take a look under the hood.

You can also select multiple specific sheets to read in after you've created an ExcelFile object.


2.2.2 Read multiple sheets into a dictionary of DataFrame objects

Suppose you wanted to select multiple sheets. The read_excel() function will parse through a list [ ] of sheet indices (zero-indexed of course!) and names to store them as entries in a dictionary object using the sheet_name arguments as keys. With this parameter you can:

  1. Mix indices and sheet names in your list!
  2. Use None to open all sheets in your file
Be careful what you ask for: When using the option to identify sheets by their index, you must remember that these will take the place of your actual sheet name as a dictionary key.
taken_readr.png
While there are many options for opening excel files, importing directly with Pandas is most convenient. However, other file types will require different methods.

Section 2.0.0 Comprehension Question: Import the 5th, 2nd, and 3rd sheets (in this order) from the file miscellaneous.xlsx into a dictionary object using their sheet names as the key.

3.0.0 Importing comma-separated values (CSV) files

Next, we will import the HMP files we discussed at the top of class and reformat them in such way that we can join the two files into a single, long-format Pandas data frame. The image below is a screenshot of the two files that make up of the HMP dataset: One file with metadata and OTU counts, and one with the OTU-taxa information. How would you do the joining? What could we use as a key to join the two datasets? (Think of this question as "what do the two datasets have in common?).

Importing and exploring your dataset, and designing a data wrangling strategy, is something that you should always do before the start of reshaping your dataset.

Let's take a quick minute to explore the files using Excel and to propose a data wrangling strategy or even start reshaping your data. Avoid scrolling down for now as the answers are in this notebook. Also, make sure to make copies of the two files and use those for your exploration. (Data inspection is also done in Python, of course, but for the sake of this exercise, use Excel.)

datasets_screenshot.png

It's a bit hard to see but here's where we want to be at the end. Remember our end objective will be to merge these two datasets into a single long-format dataset.

merged_dataset.jpg

We'll talk in more detail about how to get there but let's talk general strategy first.


3.0.1 A general strategy for data wrangling

Here is a general overview on how to convert the files into long format and join them into a single data frame:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset?

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.

  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

  4. Convert your dataset into long format to help conform with tidy data principles.


3.1.0 Data wrangling with Pandas

Data preprocessing, a.k.a. data wrangling or data munging, is the phase in which the data is prepared for downstream data analysis. It involves data cleaning, formatting, handling of special characters in datasets, etc. Preprocessing datasets can be labour intensive depending on the initial state of the data, as we will see today, but Pandas simplifies this process to a great extent.

"Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with 'relational' or 'labeled' data both easy and intuitive."

Read more: You can find the pandas package documentation here.

Let's get down to it. First, import Pandas and Numpy

3.1.1 Import our CSV file with read_csv()

Now we are all set up to tackle file 1: human_microbiome_project_metadata_subset.csv. This file contains the OTUs count and metadata (data about the data). We'll import the data with the pandas read_csv() function which will produce a DataFrame object from the import. You may recall this function from last lecture but some useful parameters include:


3.1.2 Missing values are replaced with NaN

Taking a cursory glance at the DataFrame, we can see that Unnamed column we mentioned in our strategy but we also see that the 8th column SRS_SAMPLE_ID has a number of NaN values. It appears that we do not have some information in that column and so it was imported with a value of NaN. Perhaps because some samples were never submitted to the NCBI database or for any other reason, so we will take no action.

We should note that by default read_csv() and similar commands will recognize not only empty space ('') but other potential variants of empty values like NA, N/A etc and give these the NaN value. Furthermore, from our above command, you can set additional values to be recognized as NaN using the na_values parameter. In our case, we set a single space ' ' as a recognizable missing value. You can even set a dictionary of missing value characters specific to each column. We'll talk more about missing data soon.

Read more: Find more information about the import functions in pandas here

For now, let's take a closer look at the SRS_SAMPLE_ID column. Recall we can index columns by name (.loc[colName]) or by position (.iloc[position])


3.2.0 Data(Frame) inspection using methods and properties

As mentioned before, inspecting your data to understand aspects such as types, missing values, and shape, is a crucial part of data preprocessing. Recall that methods and properties like info(), head(), and .shape, are valuable starting points.


3.2.1 Check for NaN values with the isna() and any() methods

Recall that NaN values can represent any number of potential problems with your data. Although we quickly found NaN values in one of our rows using just a cursory inspection, there may be othe NaN values imported that we haven't seen or expected. We can check for more NaN values with the isna() DataFrame method which will return a same-sized object where any NaN or similar values are mapped to the value True.

To accomplish this task we'll also apply the DataFrame.any() method in-line with our isna(). The any() function is a core Python function that checks if any of the elements within an iterable (such as a list) have the value True. In the case of the DataFrame class, the any() method will evaluate along the specific axis. The any() method parameters include:

We'll also revisit the .columns property from last lecture to return an Index object containing all of the column names.

Do I use isna() or isnull()? You may recall from lecture 2 that we used the isnull() method to identify which elements were NaN or NA. What is the difference between these two functions? Nothing. You'll find the documentation is identical as Pandas really only has NaN values. However, historically the ideas in Pandas are built upon the Dataframe object from the programming language R which distinguishes between NA and NULL values.

Now that we're familiar with the methods, we can ask: "How many columns contain NaN?"


3.2.2 Use a boolean list to retrieve columns with NaN values.

As you can see above there remains a slight issue with our assessment of the metadata - there are still 2000 columns to look through. That's essentially a metaphorical needle in a haystack! You need to narrow down your search by thinking smarter not working harder.

Recall you can supply a list of booleans to certain data structures in order to retrieve their elements. Only True values return the specific element at that index. In this case we'll index/filter our column names based on their isna().any() results!


3.2.3 Use the tolist() method to convert your objects

From above we see that all of our code creates an Index object which is part of the pandas package. It has some extra properties and methods that might be useful but we can also simply convert it to a simpler Python list if all we care about is the column names. To accomplish this we'll use the pandas method tolist() which will iterate through the object and produce a simple list of python scalars or pandas scalars for more complex values like Timestamps or intevals.

Food for thought: We could also use the list() function BUT things can get more complicated when working with nested lists or nested structures. The list() function was built as part of Python and doesn't necessarily have detailed behaviours for handling optional packages like pandas. Therefore, it is best to use the methods and functions provided (when available) by a package to coerce or re-cast objects.

3.2.4 .loc[] and .iloc[] can accept boolean lists as well

So we have a way to get a list of columns with NaN values. In fact we can use similar methods to retrieve column names for any columns that meet conditional criteria we are interested in. Suppose we wanted to cull a DataFrame down to the information we are explicitly interested in?

While we discussed the uses of loc[] and .iloc[] in the context of having column names and index positions, both of these methods can accept boolean series/arrays/lists as a means to filter for specific columns. Remember that multi-indexing uses a [row, column] format so how do we get just our columns with NaN data?


3.2.5 Additional value-checking mechanisms

Recall from last week that the isnull() function has an opposite function, notnull() which essentially flips the boolean values from isnull(). Likewise, the notna() function produces the same result.


3.2.6 Use method chaining to reduce intermediate variables

What did we just do in the above code? While we've been encountering some similar examples in this section, we haven't really addressed this coding style. One of the great things in working with an object oriented language like Python is that we can use methods in a tandem order, also known as chaining. Since many methods or properties return an object, as long as we know what kind of object it returned, we can stack method calls on the resulting object. This is especially helpful in the pandas package where core functions are often present as class methods.

From above:

  1. hmp_metadata.notna() returns a DataFrame upon which we call...
  2. DataFrame.any() returns a Series upon which we call...
  3. Series.sum() returns a scalar in this case

Of course you need to know which objects you are dealing with and their methods but it makes a pretty simple line of code!

Reduce intermediate variables: by using method chaining, we not only simplify our code but also reduce intermediate variables. If we had used simpler programming methods, we would need to produce a temporary variable to hold the output in each step outlined above before getting a final answer. This may not be a big problem in some cases but when we have no need for the intermediate results, method chaining keeps the process clean in our code, and in program memory.

3.3.0 Subsetting your DataFrame

Now that we've explored our metadata file a little bit and know that there are some missing values in the SRS_SAMPLE_ID column, we can proceed to the next step of our journey: reshaping the data. Recall that we currently have some ~2000 columns representing operational taxonomic units (OTUs) and a quantification of how many of those units are present within a sample.

3.3.1 Before reshaping hmp_metadata take a subset

A recap on the formating to be done on the metadata file:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.

  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

  4. Convert your dataset into long format to help conform with tidy data principles.

As mentioned previously, preprocesing data with large datasets is a rather inefficient strategy - not so bad if you have access to vast computational power but still inefficient. There is quite a bit of trial and error involved to find out what works best on your data, and the larger the dataset, the longer it will take the output to be generated for you to inspect and evaluate it.

Instead, we take a representative sample of the full dataset, write code to preprocess it, and then apply the program we wrote to the full dataset. For example, the full hmp_metadata file has 43,149 rows and 2,898 columns, where column indices 9 and onwards (zero-indexation) are OTUs; Four OTU columns will be enough for data wrangling so we can ignore 2,889 columns for now. The easiest way to subset data would be using iloc[] or loc[].


3.3.2 Subset your data randomly with Numpy's random.choice() function

Sure we could simply take a simple subset of our data but we can also take a random subset selection that is representative of our data set. Taking random subsets is especially helpful if you are running certain tests on data for machine-learning or programming/testing new algorithms.

To accomplish our subsetting goals, we will use Numpy's random.choice() function to generate a representative subset of hmp_metadata. Of course we'll still want those first 9 columns - it's the OTU data that we want to randomize.

Random seed (pseudo-random number generator): A function used to ensure reproducibility when a program involves randomness. It is used in processes such as random number generation or sampling from a dataset.

Now that we know more about random.choice, let's use it. Given that the a arguments expects a 1-D array or an integer (see random.choice's documentation), we will use the shape attribute

Done. Subsampling from both rows and columns seems to be working as expected so now we can create a subset. How do we use those two lines of code to sample from hmp_metadata without ending up with booleans for a final ouput?


3.3.3 Make a range of numbers with np.arange()

Not quite what we wanted... We did not catch the first nine columns and we need those. On a side note, seeing only OTU columns is simply a matter of probability as there are thousands of OTU columns versus 9 non-OTU columns. Remember we are merely giving .iloc[] a list of index positions. All we have to do is provide a static and random portion and put them together.

To achieve this we'll use the Numpy arange() function. No that's not a typo. Python 3 already has range() which returns a generator function based on the range specified. Numpy's version returns an array for the range supplied from start (inclusive) to end (exclusive).

What is a generator? While we will cover things in more detail our flow control lecture, you should know that generators are functions which produce a group of values that can be run/looked through when supplied to the correct kind of Python functions.

To put two arrays together, we use the np.append() function. Let's give it a try.


3.3.4 Use the DataFrame sample() method as an alternative to np.random.choice()

An alternative to random.choice() for random subsampling of either rows or columns is the sample() method. Using the parameters you can decide on:


3.4.0 Generate summary information for our data subset

Before we continue let's look deeper at our data subset to see what our data is like. We could have done this with our larger dataset but it would likely be overwhelming. We can use properties like .dtypes to find out the data types for our columns, and the method describe() to produce summary statistics for those columns too.


3.4.1 The describe() method works for numeric information

From above, we see that our summary statistics do not include the SEX, RUN_CENTER, HMP_BODY_SITE, HMP_BODY_SUBSITE, and SRS_SAMPLE_ID columns. Given that these are character-based, the describe() method cannot generate any summary statistics for these columns.

If, however, we explicitly provide non-numeric column(s) to describe() we are returned a truncated or altered version of the output. You can provide a single or multiple character columns with similar results.


3.4.2 Use the unique() method to examine values in your DataFrame

As we can see from above, there are 18 unique HMP_BODY_SUBSITE values. We can pull see what they are using the unique() method on a column from our DataFrame. This is a quick way to see the nature of values in your dataset as well. Let's give it a try.


3.4.3 Retrieve your DataFrame as a nested np.Array with the .values property

Another way to access portions or all of the values in your DataFrame, you can use the .values property which you can also subset with multi- or chained-indexing.


3.5.0 Reshaping your data

So we've discussed a few ways to generate subsets from our data using numpy.random.choice() and pandas.DataFrame.sample(). This is helpful when generating code to reshape our data because the original information can be extremely dense and, in some cases, computational intensive to just "play" with. We've even generated some summary information our subsets as well, just to help us sanity check for larger issues we may have overlooked when initially importing the data.

Let's see where we are in our steps:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$

  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

    • In the current case we want to remove the Unnamed: 0 column. It looks to be an artefact from when the file was originally exported. It's not a column we need and we should remove it.
  4. Convert your dataset into long format to help conform with tidy data principles.

Up until this point we haven't really altered any of the data we imported. As per our data wrangling plan, however, the next step is to remove column 0 from hmp_data. It is just a column with indices starting at 1 that was created when the file was written to disk (saved).

Before we start deleting anything from the dataset, inspect the dataset and pull out a list with column names. The fastest way to retrieve a list of column names is to cast the DataFrame as a list().


3.5.1 Remove a column using the drop() method

As usual, there are several ways to achieve the same goal. In this case, we want to remove the first column, Unnamed: 0, from our DataFrame. We can remove that column by its index using drop() method from the DataFrame class. This is accomplished in a single call or one-liner (nickname for programs that are one line long).

You can complete this code using the index position(s) or the column name(s).


Removing by index can be dangerous! Even though the above code does its job as expected, it is very dangerous. Since it uses indexation to locate the column we wish to remove, once the current column 0 is removed, all other columns will be shifted one position to the left. As a consequence you will now have another column at index 0, and if you run this piece of code again, it will continue removing whatever column is present at index 0. You can end up removing many more columns than you intended, by mistake, with potential serious negative implications for your analyses. Thus, it is safer to use a syntax that explicitly looks for the column of interest, like using the full column name or by partial matching (more on that in Lecture 6).

If we try to run this same code again, we get a KeyError because there is no column called "Unnamed: 0" anymore.


3.6.0 Melting your variable columns into just two columns

Now that we've removed the Unnamed: 0 column from our DataFrame, let's see where we are in our steps:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$

  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed. $\checkmark\checkmark$

    • In the current case we want to remove the Unnamed: 0 column. It looks to be an artefact from when the file was originally exported. It's not a column we need and we should remove it.
  4. Convert your dataset into long format to help conform with tidy data principles.

    • Our goal is to combine all of the OTU names into a single column and their counts. Currently they are spread across multiple columns as column names with the counts in each OTU column.
    • We'll convert our data through a process known as "melting".

3.6.1 Melting your variable columns with the melt() function

reshaping_melt.png
We convert our wide-format data to long-format using a function like melt() which compresses variable (ie column) names into a variable column, and their associated values into a value column. This image was provided by the Pandas online documentation.

We can "melt" all of the OTU variable values into a single column. This essentially means we are going to compress all of the OTU column information into two columns. The first column holds information like a variable name (ie the OTU code), while the second column holds the value. Thus a new set of observations (rows) is added for each variable (column) we are melting.

Another important aspect of this step is that the unmelted information (untouched columns) in each observation must combine to uniquely identify observation. In our case, columns 0-7 hold the information that will remain untouched. That data will now be repeated multiple times for each new variable that is converted to an observation.

We'll accomplish this with the Panda function melt() or the identical DataFrame method of the same name. The key parameters to keep in mind are:

3.6.2 Use the DataFrame object .melt() method directly

While the pandas package has it's own melt() function, you can directly call on the .melt() method when working directly with a DataFrame. Both have their advantages since the melt() function can be used when cycling through a list of DataFrames for instance. Use the .melt() method directly reduces potential typo issues that could accidentally access the wrong DataFrame variable.


3.6.3 Provide a list of column names to the melt() function

Can you explain why our melted data frame is 500 rows x 10 columns?

If you wanted to keep a subset of identifier columns that were not strictly grouped together you have a few options for creating that list to pass on to the id_vars parameter of melt():

  1. Make an explicit list of column names
  2. Slice and combine subsets of columns from your DataFrame. Just remember any excluded columns will become variable columns unless you explicitly name those as well!

As you can see above, there are consequences to subsetting by your potential list of id_vars. We now have populated our OTU column with additional undesired information. In this case, you would be better off removing the offending columns first with the .drop() method.


3.7.0 Put your code together now that you've tested it

And... that was the last change to hmp_data. Now that we have the file in the shape we wanted, let's preprocess the entire dataset by combining all of the code snippets we've generated.


3.7.1 Drop intermediate variables with method chaining

As you can see above, while our code is "easier" to read, it's a little inefficient as we create in intermediate variable hmp_data_drop. We can forego that using method chaining and this will make a big difference as your wrangling becomes more complicated!

In the following example we'll combine the drop() and melt() steps into a single chained command. We'll also use the \ at the end of each line in our assignment. This will signal to the Python interpreter the continuation of our assignment expression. This makes your code easier to read for yourself and others.

Section 3.7.1 Comprehension Question: As we've discussed there are many paths to the same result. Update the following code to melt hmp_data by excluding the "SRS_SAMPLE_ID" column from the original dataset as well. This should result in a final DataFrame with one less identifier column but the same number of rows.

data-wrangling-ooooooh-yyyeeeeaaahhh.jpg
One file down. One more to go.

3.8.0 Data wrangling hmp_taxa_meta

While our measurement data has been tidied, the data regarding the OTU designations in our dataset has been separated into a metadata set. Before we can unite our metadata with the measurement data, we must prepare it as well into a properly formatted wide-format dataset. While this doesn't fit with our previously wrangled data, it will soon make more sense to us.

3.8.1 Import and explore hmp_taxa_meta

Recall our general steps:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset?

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.

  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

  4. Convert your dataset into long format to help conform with tidy data principles.

Remember that the ultimate goal is to join the metadata and OTU-taxa datasets. Let's start by importing and inspecting the OTU taxa set.


Check for NaN values within our data.


NaNs are much more abundant in this dataset compared to hmp_data. How many columns with NaN values are we dealing with?


3.8.2 Subset hmp_taxa_meta

We've imported and taken a look at our data. So 198 columns contain incomplete data but we'll just leave those for now.

Recall our OTU-taxa file plan:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it.

    • There are only 7 rows so we'll keep all of those but we can trim how many columns we are taking for now.
  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

  4. Convert your dataset into long format to help conform with tidy data principles.

Let's take just a small subset of taxa columns to work with.


3.8.3 Fix column names and remove unwanted data

We've imported and taken a look at our data and subsetting was quite straight-forward.

Recall our OTU-taxa file plan:

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$

    • There are only 7 rows so we'll keep all of those but we can trim how many columns we are taking for now.
  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed.

    • We'll rename the Unnamed: 0 column to rank_name. This is an extra modification - extra because is not indispensable.
    • Row index 0 appears to have redundant data to the other 6 rows. We'll discuss this step in a little more detail but the route we'll take is to remove this information.
  4. Convert your dataset into long format to help conform with tidy data principles.

3.8.3.1 Rename column 0 of hmp_taxa with the rename() method

Renaming columns from our DataFrame is pretty straightforward but does require some thought. First, renaming a single column could normally be a simple case but what happens if we want to rename multiple columns? The Pandas package lets us rename multiple DataFrame indices (rows) OR columns with the rename() method but we should take careful note of the parameters:

Let's start by renaming that column.


3.8.3.2 Remove row 0 from hmp_taxa_meta_sub

So we now have a subset of our data and we've renamed that first column to rank_name. It's time to remove that first row from our dataset.

From hmp_taxa_meta_sub let us remove row 0 (consensus lineage). The OTU-taxa dataset is redundant: row 0 (consensus lineage) contains the same information as rows 1 to 6. This redundancy means that we have two alternatives to convert taxonomic ranks into columns and OTUs into a single column (as we did with hmp_data):

  1. Use only row 0, split the contents of each cell based on special characters such as colons and underscores, use the single letters as taxonomic ranks (e.g. c stands for "class"), and then melt and populate the dataset with the contents of the OTU columns. This approach requires the use of regular expressions (regex), and we are not quite there yet. We will revisit this same exercise in week 6 (regex) and will use this option to format this same dataset.

  2. Remove "consensus lineage" from the dataset and use the remaining six rows and all OTU columns. This alternative is more suitable for our Python programing skills at this point so that is what we will use.

Remember how we used the drop() method to remove a column from hmp_data? Now we can go ahead and use the same method to drop a row as well. By default it will assume you are referring to row indices (axis = 0) when you provide an index as an integer.


3.8.4 Reshape our OTU taxa information

We've now completed some data reformat and trimming and we're onto the final step!

  1. Import and explore the file to identify issues that might be present with your data. What general problems may arise as you wrangle this dataset? $\checkmark\checkmark$

  2. Create a subset of your data. Sometimes you are working with VERY large datasets. The process of wrangling a large file can take a lot of processing power and time. You don't want to spend your time experimenting on the entire dataset to come up with a customized procedure to wrangle it. $\checkmark\checkmark$

    • There are only 7 rows so we'll keep all of those but we can trim how many columns we are taking for now.
  3. Remove unwanted data. Some columns or rows may be irrelevant to your analysis or unuseable due to missing data discovered in step 1. Eliminate these as needed. $\checkmark\checkmark$

    • We'll rename the Unnamed: 0 column to rank_name. This is an extra modification - extra because is not indispensable.
    • Row index 0 appears to have redundant data to the other 6 rows. We'll discuss this step in a little more detail but the route we'll take is to remove this information.
  4. Convert your dataset into long format to help conform with tidy data principles.

    • To get our data into it's final form we'll want to start by melting all of the OTU taxonomic information into two columns: OTU and taxonomic_level.
    • Afterwards we'll want to spread the dataset back out into columns based on rank_name.

3.8.4.1 Melt hmp_taxa_meta_sub into a 3-column dataset

There is only one column, rank_name that we need to keep as an identifier. All of the other columns are OTUs that we'll want to melt into OTU and taxonomic_level.

What will our melted DataFrame look like?

rank_name OTU taxonomic_level
SUPERKINGDOM OTU_97_1 Bacteria
PHYLUM OTU_97_1 Firmicutes
CLASS OTU_97_1 Bacilli
ORDER OTU_97_1 Lactobacillales
FAMILY OTU_97_1 Lactobacillaceae
GENUS OTU_97_1 Lactobacillus
... ... ...



Let's melt hmp_taxa_metadata_subset and create a new dataframe called hmp_taxa_metadata_subset_melt.


3.8.4.2 Redistribute your data with the pivot() method

Looking at our melted data, we have now converted the OTU columns into rows. Taking a closer look, however, is the information in a format that we want to work with? When we think of the identifier information from our other table, hmp_data_tidy, it is hard to imagine how we'll combine the tables (our ultimate goal).

Furthermore, keeping taxonomic ranks as rows will bring difficulties later on when we try to access the data for statistics and plotting. This format also generates six rows per OTU (six taxonomic ranks) instead of a single row per OTU. You can see that what are essentially identifiers of the same "observation" are therefore spread across 6 rows. This doesn't exactly follow our philosophy on tidy data.

Instead, let's convert the information in rank_name to column names, which will convert the redundant OTUs column values into a column of unique values, with the taxonomic_level values filling out the new columns made from rank_name.

To accomplish our conversion goals, we'll use the pivot() method which has the following relevant parameters:

reshaping_pivot.png
We reshape our long-format data to a wide-format which will represent each OTU as a single observation with multiple variables as the different taxonomic levels. This is accomplished with the pivot() method. This image was provided by the Pandas online documentation.

First, recall that we can directly call on column names as though they are attributes/properties of the DataFrame. As sanity check, let's see what our new column names would be.


3.8.4.3 Choose your .pivot() variables carefully

We have 6 unique rank_name values and each will become its own column once we .pivot() the melted data frame, in addition to the OTU column. Use Panda's DataFrame .pivot() method on the hmp_taxa_metadata_subset_melt dataframe, making sure that the OTU values become the index for the new dataset - this is key and will be used when joining the two datasets.

Index and Column combinations: When generating pairs of index/column combinations, note that these combinations must be unique. If for some reason your combination results in a non-unique observation being generated (ie two entries for OTU_97_1 under the CLASS variable), then an error will be returned.

3.8.4.4 The pivot() method has no knowledge of your column order

As we can see from above, we did manage to spread our data out but the columns have been placed by alphabetical order. While that may not be a huge deal for Python when working with the data in later steps, our human brains betray us. We'd much rather have this information in the taxonomic order that we're used to.

To accomplish this we can reorder our DataFrame based on either an explicitly coded or data-generated list. Recall that using the [ ] indexing operation with a list will return a DataFrame with the columns in the order specified.

A quick note about the pivot conversion. You can see that the column names are returned in an Index object whose name is "rank_name" and the row indices are returned in an Index object named "OTU". This gives us the output we see when we call on the head() method of our pivoted DataFrame.


3.8.5 Combine our code to format the hmp_taxa_meta dataset

Now we have our dataset in the way we wanted it. Let's apply the code to the entire dataset so we can finally join the two datasets.


Notice that the dataset does not necessarily have 999 unique OTUs. We have 999 rows because some OTUs have the same taxonomic assignment after ignoring NaN values (this is explained by how OTUs are created using similarities against a consensus sequence). In fact, if we only look at the GENUS variable we have 85 unique genera.

We can use the count() method to get a summary of non-NaN values within each column of our DataFrame as well. This is a useful way to also sanity check your data for completeness.


Section 3.8.5 Comprehension Question: We state above that there are 85 unique genera. Yet using the count() method we see there are 86 values. What is the reason for this discrepancy? Generate a code snippet to check how many different Genera we actually have.

Why are our unique() and count() values different?

Answer:


4.0.0 Merging data

Into the final stretch, now that we have our data in a tidy format, it is time to merge it together. This can also be referred to as joining our data. In order to successfully do so, we need at least one column from each DataFrame that we can use as a reference to combine the data. These columns can also be referred to as keys.

In our case, we will use the unique OTU identifiers as a key to join the two datasets by matching their indices. Other programs do a good job of merging tables for us versus using a copy-paste method in spreadsheet software. You can mess up your dataset and ultimately negatively affect your results. In our case, our data:metadata relationship has many data observations linked to single observations.

There are several ways to join two datasets with the Pandas package:


4.1.0 Generate subsets of our tidy data to the joining process

Recall that both of our tables each share OTU designations for each observation. We can take advantage of this to merge our data together between the two datasets. First, however, we'll practice again on a smaller subsets of our tidied data.

4.1.1 Test for values in a DataFrame with the isin() method

Before we proceed with our example, we will subset the datasets in such way that they both have the same OTUs so we can get them to match. The same 10 OTUs in each dataset will be enough for our example. The first task is to select 10 OTUs from either file and then use those same 10 OTUs to subset both data frames.

The first step of choosing some OTUs from hmp_taxa_meta_tidy is easy. To check for entries that exist in hmp_data_tidy we need to compare the correct column of information, and see if each element of that column is contained within our set of OTU values. The isin() method will return a DataFrame of booleans on whether any elements in our DataFrame object are contained within the values we supply.


4.1.2 Use boolean indexing to subset your DataFrame

Now that we have a list of 10 OTUs to test with recall that we can use the boolean values from isin() as way to index data from our DataFrame objects. Recall that when we directly index with [ ] on a DataFrame, this is completed by returning all rows that match the indices marked as True. This is a short-hand version of using either the .loc or .iloc methods when working with data rows.

Both datasets have the same OTUs.


4.2.0 Use merge() to combine our practice subsets

Let's take a closer look at the merge() function which has some of the following parameters

Let's combine our two subsets now hmp_data_10_otus and hmp_taxa_10_otus now that we've ensured they share the same keys. Note that this isn't strictly necessary. Each join type in the how parameter deals with merging observations based on keys in a different manner.

Join type Description
left Keep everything from the left DataFrame and only merge intersecting keys from the right DataFrame. Missing keys are filled with NaNs
right Keep everything from the right DataFrame and only merge intersecting keys from the left DataFrame. Missing keys are filled with NaNs
outer AKA a full outer join, creates the union of keys and merges based on these. Unmatched cases are filled with NaNs
inner Creates the intersection of the keys from left and right and merges based on these

The next step is to join the two datasets by a common column or index. If we decide to do this by column, the columm(s) that we use as a key must have the same name in both datasets. If we decide to go with the indices as key for the merge, we just set the merge() arguments left_index and right_index to True.

In summary, we either set the hmp_data_tidy OTU column as index using the set_index() method OR make hmp_taxa_meta_tidy's index into a column called OTU.

Check it out: Read more about the wonderful world of merging here.

Let's merge the taxa metadata into our actual data set based on their shared index values.

4.2.1 Use the set_index() to reassign a variable to the index

While you can use the .index property to set the values of the index in your DataFrame, this requires you supply values in some form but does nothing else to the DataFrame. If you find yourself in a situation where you would like to convert one of your columns into an index, you could use the .index property and drop() method to remove the column afterwards since it exists as redundant information. The set_index() method, however, allows us to do this all at once by simply supply the column name that should replace the index.


4.3.0 Apply the final code to the "full" datasets

As simple as that! Now we have a single dataset so we achieved our last goal. Time to apply the code to the full datasets.

Caution when merging large datasets This code will not run in your computer with the large datasets (no problem with the alternative subsets). It will crash. Merging the two files results in a file that is 20 Gb in size and has 12 million rows. If we started with two files, one was 251.2 Mb and the other 8.4 Mb, how is it possible that we end up with a 20 Gb plain text file?? Ultimately, in the process of reshaping our data, we generate redundant information across observations of different variables (OTUs).

Section 4.3.0 Comprehension Question: Notice the information from above? We started with 1,991,000 rows of data in our long-format hmp_data DataFrame. After merging we have only 999,000 rows. Is there an alternative set of merge parameters which will retain our original data regardless of having OTU the metadata available?

Answer:


4.3.1 You will always be wrangling data in one way or another

Hopefully you got a hang of what data wrangling (prepocessing) is about and why it often represents 80 to 90% of your data analysis time. The next time you start planning your experiment, it is advisable to also plan ahead how to record your data so you do not have too spend too long formating it.

As a bonus, if you are mining data or repeating experiments and always getting the same kind of data format, then once you have built these kinds of scripts you can reuse them!


4.4.0 Write your data to file with the to_csv() method

The Pandas DataFrame class has a built-in method for exporting our data to a various file formats like the comma- or tab-separated value formats. We can actually just use the to_csv() method for both file types because we can alter some of the parameters during the call:

Read more: You can go through more parameters by checking out the ?pd.DataFrame.to_csv command or going to the pandas documentation here.

4.4.1 Write to an excel file with the to_excel() method

Similar in idea to the to_csv() method, this will allow us to directly write our DataFrame objects as Microsoft Excel files. We just need to set the proper parameters:

Read more: You can go through more excel-writing parameters by checking out the ?pd.DataFrame.to_excel command or going to the pandas documentation here.

5.0.0 Class summary

That's our third class on Python! You've made it through and we've learned about about working with data files:

  1. The tidy data philosophy.
  2. Reading in .xlsx and .csv file formats.
  3. Strategies for data wrangling - subsets, removing data, merging tables.
  4. Writing your wrangled data to a file.

5.1.0 Submit your completed skeleton notebook (2% of final grade)

At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).

5.2.0 Post-lecture DataCamp assessment (8% of final grade)

Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 of Data Manipulation with pandas (Transforming Data, 950 possible points; Aggregating Data, 1300 possible points; and Slicing and Indexing, 1150 possible point). This is a pass-fail assignment, and in order to pass you need to achieve a least 2550 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:

DataCamp.example.png
A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, February 3rd to submit your assignment (right before the next lecture).


5.3.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.4.0 Resources


6.0.0 Appendix 1: the openpyxl package

6.1.0 Use the openpyxl package to open excel files

In earlier years the xlrd package could be used to read excel spreadsheet files of the format .xls. However, in the past decade Microsoft has focused on the .xlsx format which the xlrd package supported but development on the xlrd package officially ended support for anything but the .xls format as of December 2020.

Now the official recommended package is openpyxl to open the more complex excel files. You can also use this package to make and save excel spreadsheet files so it's a good one to keep in your toolbox.

Let's start by importing the file miscellaneous.xlsx from the data directory by calling on the load_workbook() function.


6.1.1 Look at sheet names with .sheetnames property

Now that we've generated this Workbook object, we can look at some of it's properties, like the names of the various sheets using the .sheetnames property. Remember this will be holding data in a complex hierarchy where we'll have to break down the information into something like a DataFrame object.

From there we can access individual sheets using the ["sheetname"] indexing operators.


6.1.2 Access the values of a worksheet with the .values property

Now that we have pulled out a Worksheet object, we have to imagine that there are values there for us to use. The way to access this information on the various cells is with the .values property. We'll see, however, that it's not exactly a data structure like we imagine it.


6.1.3 Iterate through a generator with the next command

As you can see from above, we managed to convert our worksheet data into a DataFrame but it wasn't exactly a perfect conversion. Looking at the original object we get from the .values property, we get a generator object in return.

A generator is a special function that can make its way through an iterable object (like a list or dictionary) but can return a sequence of values and not just a single value. It will also yield it's execution wherever it was in the iterable object and return that state to you in another generator object. Sounds confusing right?

Simply put, this kind of function serves as an iterator for moving through the rows of our sheet value data one at a time using the next() command. This is helpful to know because looking closer at our DataFrame we see that the header information was deposited into the first row of our DataFrame rather than as column names. There are a few ways to remedy this but one way is to properly iterate through our generator with the next() command.

By default the next() command of our Worksheet.values generator will return a row from our values. Once it gets to the end of the structure, it will return a StopIeration error when calling next().

Let's take a look ourselves.

Why did we get the stop iteration? When we assigned it to a DataFrame above, it iterated through the whole .values generator and now there's nothing else to squeeze from it. We'll have to start over again.


6.1.4 The next() function returns a tuple

That's right, our good friend the tuple is back and therefore we can slice it with the [] operators. That means we can pull out specific "columns" that we're interested in from each row.


6.1.5 Use next() to extract a header for a DataFrame

Now that we know the inner workings a little more, we can grab the first row from the Worksheet.values generator and keep that for our columns. Then we can pass the rest of the generator to the initialization of the DataFrame. Let's see that in detail.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png